An Introduction to Machine Learning with Scikit-learn

By Michelle Lochner

First we need to import some tools from scikit learn and the classifiers we're going to use


In [1]:
from __future__ import division, print_function
from sklearn.datasets import make_circles

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, roc_auc_score

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor

import numpy as np
import time
import matplotlib.pyplot as plt
%matplotlib nbagg

In [2]:
def plot_roc(fpr, tpr):
    """
    Simple ROC curve plotting function.
    
    Parameters
    ----------
    fpr : array
        False positive rate
    tpr : array
        True positive rate
    """
    
    plt.plot(fpr, tpr, lw=1.5)
    plt.xlim([-0.05,1.05])
    plt.ylim([-0.05,1.05])
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')

Generating data

Scikit learn has several built in datasets as well as functions to generate random data. We'll use one of these for this example but it's straightforward to put in your own data.


In [3]:
# X is the array of features, y is the array of corresponding class labels
X, y = make_circles(n_samples=1000, noise=0.1, random_state=0)

To avoid overfitting, we split the data into a training set, used to train the algorithm, and a test set, used to evaluate its performance. There's no hard and fast rule about how big your training set should be, as this is highly problem-dependent. Here, we'll use 70% of the data as training data.


In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

You should always rescale your features as many algorithms (including SVM and many neural network implementations) assume the features have zero mean and unit variance. They will likely underperform without scaling. In this example, the generated data are already scaled so it's unnecessary, but I leave this in to show you how it's done.


In [5]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Now we can have a look at our training data, where I've coloured the points by the class they belong to.


In [6]:
plt.figure()
plt.plot(X_train[y_train==0,0], X_train[y_train==0,1],'.')
plt.plot(X_train[y_train==1,0], X_train[y_train==1,1],'.')
plt.legend(('Class 0', 'Class 1'))


Out[6]:
<matplotlib.legend.Legend at 0x7f761e5399b0>

Classification

Let's start classifying! Scikit-learn is fully object-oriented so each classifier is an object. While there are dozens of different algorithms available, they all behave in the same way, implementing the same functions, meaning it's very easy to swap classifiers in and out.

Here we create a K-nearest neighbours classifier object, which has several hyperparameters that are set to default values.


In [7]:
clf = KNeighborsClassifier()
print(clf)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

This is the function that actually trains the classifier with our training data.


In [8]:
clf.fit(X_train, y_train)


Out[8]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Now that the classifier is trained, we can use it to predect the classes of our test data and have a look at the accuracy.


In [9]:
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))


0.836666666667

But since the accuracy is only part of the story, let's get out the probability of belonging to each class so that we can generate the ROC curve.


In [10]:
probs = clf.predict_proba(X_test)

In [11]:
fpr,tpr, thresh = roc_curve(y_test, probs[:,1], pos_label=1)
auc = roc_auc_score(y_test, probs[:,1])
print('Area under curve', auc)


Area under curve 0.898042269188

In [12]:
plt.figure()
plot_roc(fpr, tpr)


Optimising hyperparameters

The K-nearest neighbours classifier has several hyperparameters that we've just left as default. If we optimise these instead, we get a better result. The most robust way to do this is with one of scikit-learn's cross validation methods. Here we'll use GridSearchCV. Naturally now the algorithm will take much longer to train as it has to train and evaluate performance several times.


In [13]:
t1 = time.time()
clf = KNeighborsClassifier()
# Define a grid of parameters over which to search, as a dictionary
params = {'n_neighbors':np.arange(1, 30, 1), 'weights':['distance', 'uniform']}
# cv=5 means we're doing 5-fold cross validation.
clf = GridSearchCV(clf, params, cv=5)
clf.fit(X_train, y_train)
print('Time taken',time.time()-t1,'seconds')


Time taken 1.181649923324585 seconds

In [14]:
# We can see what were the best combination of parameters
print(clf.best_params_)


{'weights': 'distance', 'n_neighbors': 16}

Let's see if the accuracy has improved


In [15]:
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))


0.843333333333

The accuracy is more or less unchanged, but we do get a slightly better ROC curve


In [16]:
probs = clf.predict_proba(X_test)

In [17]:
fpr_knn, tpr_knn, thresh = roc_curve(y_test, probs[:,1], pos_label=1)
auc_knn = roc_auc_score(y_test, probs[:,1])
print('Area under curve', auc_knn)


Area under curve 0.924226918799

In [18]:
plt.figure()
plot_roc(fpr_knn, tpr_knn)


Using a different algorithm

Scikit-learn is designed to let you easily swap algorithms out


In [19]:
clf = SVC(kernel='rbf', probability=True)
clf.fit(X_train, y_train)


Out[19]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [20]:
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))


0.86

In [21]:
probs = clf.predict_proba(X_test)
fpr_svm, tpr_svm, thresh = roc_curve(y_test, probs[:,1], pos_label=1)
auc_svm = roc_auc_score(y_test, probs[:,1])
print('Area under curve', auc_svm)


Area under curve 0.939510567297

In [22]:
plt.figure()
plot_roc(fpr_knn, tpr_knn)
plot_roc(fpr_svm, tpr_svm)
plt.legend(('KNN (%.3f)' %auc_knn, 'SVM (%.3f)' %auc_svm), loc='lower right')


Out[22]:
<matplotlib.legend.Legend at 0x7f7617f8cba8>

You can do cross validation to tune the hyperparameters for SVM, but it takes a bit longer than for KNN.

Over-fitting

This is a somewhat contrived example to demonstrate over-fitting.

First we make some fake data with a few outliers in it.


In [23]:
np.random.seed(42)
x = np.linspace(-3,3, 100)
y = np.sin(x) + np.random.randn(len(x))*0.05
N = 25
outlier_ints = np.random.randint(0, len(x), N)
y[outlier_ints] += np.random.randn(N)*1
plt.figure()
plt.plot(x,y,'.')
plt.xlabel('x');
plt.ylabel('y');



In [24]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

y_train = y_train[np.argsort(X_train)]
X_train.sort()
y_test = y_test[np.argsort(X_test)]
X_test.sort()

X_train = X_train[:, None] # sklearn doesn't like 1d X arrays
X_test = X_test[:, None]


dt1 = DecisionTreeRegressor(max_depth=10) # An overly complicated classifier
dt2 = DecisionTreeRegressor(max_depth=3) # A simpler classifier


dt1.fit(X_train, y_train)
dt2.fit(X_train, y_train)

y_train_1 = dt1.predict(X_train)
y_train_2 = dt2.predict(X_train)


y_test_1 = dt1.predict(X_test)
y_test_2 = dt2.predict(X_test)


plt.figure()
plt.plot(x,y,'.')
plt.plot(X_test, y_test_1, lw=1.5, alpha=0.5)
plt.plot(X_test,y_test_2, lw=1.5, alpha=0.5)

plt.xlabel('x')
plt.ylabel('y')
plt.legend(('Data', 'Max depth 10', 'Max depth 3'));


You can see the more complicated decision tree learns the behaviour of the spurious outliers. It's easy to check for over-fitting, whatever metric you're using (in this case mean squared error) will show much higher performance on the training than on the test set.


In [25]:
mse_train = np.mean((y_train-y_train_1)**2)
mse_test = np.mean((y_test-y_test_1)**2)
mse_train, mse_test


Out[25]:
(0.0001413211084962647, 0.54564383525361615)

In [26]:
mse_train = np.mean((y_train-y_train_2)**2)
mse_test = np.mean((y_test-y_test_2)**2)
mse_train, mse_test


Out[26]:
(0.1934993327434579, 0.13793736239832607)

We'll now use cross validation to automatically choose the hyperparameters and avoid over-fitting


In [28]:
dt3 = GridSearchCV(DecisionTreeRegressor(), param_grid={'max_depth': np.arange(2,12)}, cv=5)
dt3.fit(X_train, y_train)

y_train_3 = dt3.predict(X_train)
y_test_3 = dt3.predict(X_test)

mse_train = np.mean((y_train-y_train_3)**2)
mse_test = np.mean((y_test-y_test_3)**2)
mse_train, mse_test


Out[28]:
(0.26459745630740678, 0.20175110292504578)

In [29]:
print(dt3.best_params_)


{'max_depth': 2}

Why does it choose max_depth=2 when we found max_depth=3 gave a better MSE? Because Decision Trees are high-variance, meaning they're very sensitive to the exact choice of training and validation set so our previous result isn't general. This highlights why you need cross-validation and why Decision Trees alone aren't very good.

Commentary on over-fitting

The overly-complicated model overfits the data, incorporating noise into the model. The easiest way to see if you're overfitting is to compare whatever metric you're using (e.g. mean-square-error or AUC) between training and test sets. Usually these numbers should be fairly similar. If performance is significantly higher in the training set, you're overfitting.

To avoid overfitting:

  1. Always keep a separate test set that you don't touch when fiddling with algorithms
  2. Use cross-validation to optimise hyper-parameters
  3. Use algorithms that are more robust to over-fitting (e.g. Random Forests)

Note: The above numbers are sensitive to the random seed used and exact data set (change it and see) but on average you'll see the decision tree with higher max depth will tend to overfit.

Next steps

You can try the same thing with a different algorithm (just look up a few on scikit learn), have a look at a regression problem which works in a similar way or for some real fun, try out a different dataset!